AI GPU Ollama Continue Dev
Pull Ollama Model
# Pull the 8-bit (30GB) instruction-tuned model
ollama pull gemma3:27b-it-q8_0
Ollama Modelfile
Modelfile.gemma-coder-v1
# Start from our new, higher-fidelity 8-bit model
FROM gemma3:27b-it-q8_0
# === Enhancement 1: Max Out The Context Window ===
# Gemma 3 27B supports 128K (131072) tokens.
# Maxing this out gives RAG tools like Continue.dev the
# largest possible "workspace" to feed your project's context.
PARAMETER num_ctx 131072
# === Enhancement 2: Advanced Coherence Sampling (Mirostat) ===
# We are replacing nucleus sampling (temp/top_k/top_p)
# with Mirostat 2.0. Mirostat targets a specific level
# of "coherence" (perplexity), which is often superior for
# long, logical, multi-step outputs like code files.
# It "learns" as it generates to stay on topic.
PARAMETER mirostat 2 # Use Mirostat 2.0
PARAMETER mirostat_tau 5.0 # Controls the target coherence (5.0 is default)
PARAMETER mirostat_eta 0.1 # Controls the "learning rate" (0.1 is default)
# (We no longer need temperature, top_k, or top_p)
# === Enhancement 3: Explicit Stop Tokens ===
# Tell the model exactly when its "turn" is over.
# This prevents it from rambling or role-playing as the user.
# "<end_of_turn>" is the specific token used by Gemma 3.
PARAMETER stop "<end_of_turn>"
PARAMETER stop "<|user|>"
PARAMETER stop "User:"
# === Enhancement 4: "v2" System Prompt ===
# We are making the prompt more aggressive by forcing
# a Chain-of-Thought and Self-Correction loop.
# We also add specific formatting instructions for multi-file code.
SYSTEM """
You are an expert-tier AI-pair programmer and software architect.
You will be given a task and context from a codebase.
Your response must be 100% code, configuration, or the requested artifact.
Do not add conversational fluff, greetings, or explanations.
Follow this multi-step reasoning process:
PHASE 1: Think step-by-step. Break down the user's request into a concrete plan.
PHASE 2: Execute the plan. Write the code or configuration to satisfy the request.
PHASE 3: Review and self-correct. Check your output for errors, optimizations, and adherence to the "100% code" rule.
When you must provide code for multiple files, use this *exact* markdown format:
--- path/to/your/file.py ---
python
# ... your code for file.py ...
--- path/to/another/file.json ---
json
{
"key": "value"
}
Begin execution. """
Ollama Customized Model
ollama create gemma-coder-v1 -f ./Modelfile.gemma-coder-v1
Embedding Model
ollama pull mxbai-embed-large-v1
Continue Dev
Edit ~/.continue/config.yaml
# ~/.continue/config.yaml
models:
- name: "Local Coder"
title: "Gemma 3 27B Coder v2 (Q8)"
provider: ollama
model: "gemma-coder-v2:latest"
apiBase: "http://localhost:11434"
- name: "Local Embedder (SOTA)"
title: "MixedBread Embed Large"
provider: ollama
# === THE UPGRADE ===
model: "mxbai-embed-large-v1:latest"
apiBase: "http://localhost:11434"
embed: true
# ... (rest of your config) ...
modelRoles:
chat: "Local Coder"
edit: "Local Coder"
contextProviders:
- name: "code"
params:
"embeddingsProvider": "Local Embedder (SOTA)"
# ... (rest of your config) ...
# ~/.continue/config.yaml
models:
- name: "Local Coder"
title: "Gemma 3 27B Coder v1 (Q8)"
provider: ollama
model: "gemma-coder-v1:latest"
apiBase: "http://localhost:11434"
- name: "Local Embedder (SOTA)"
title: "MixedBread Embed Large"
provider: ollama
# === THE UPGRADE ===
model: "mxbai-embed-large-v1:latest"
apiBase: "http://localhost:11434"
embed: true
# ... (rest of your config) ...
modelRoles:
chat: "Local Coder"
edit: "Local Coder"
contextProviders:
- name: "code"
params:
"embeddingsProvider": "Local Embedder (SOTA)"
- name: "docs"
- name: "diff"
- name: "terminal"
# ... (rest of your config) ...
embeddingsProvider: "Local Embedder (SOTA)"
Domain-Tuning with (Q)LoRA
First, let's clarify the terms.
-
LoRA (Low-Rank Adaptation): This is the core technique. Instead of re-training the entire 30GB model, you "freeze" it and train a tiny new set of weights (the "adapter," often < 100MB). This adapter "steers" the models output to be better at a specific task (e.g., writing only Rust code, or adopting a specific persona).
-
QLoRA (Quantized LoRA): This is an efficiency technique for training LoRAs. It quantizes the base model (e.g., to 4-bit) during the training process, which dramatically lowers the VRAM needed to perform the fine-tuning.
For your goal (applying a pre-trained adapter), you only need to care about the LoRA file itself.
How to Apply LoRA Adapters in Ollama
Ollama makes this incredibly simple using the ADAPTER instruction in a Modelfile. This "melds" the adapter with the base model at runtime.
Let's build a new model, gemma-coder-v2-rust, that's specifically tuned for Rust development.
- Step 1: Get a LoRA Adapter You'll find these adapters on Hugging Face. The key is to find adapters in GGUF format, as these are compatible with Ollama.
Go to: Hugging Face Collections (ggml-org)
Search for: Adapters trained for your specific domain (e.g., "code," "python," "rust"). Let's pretend we found a hypothetical adapter file named rust-code-lora.gguf.
- Step 2: Create a New "Specialist" Modelfile Create a new file named Modelfile.gemma-rust:
Modelfile.gemma-rust
# Start from our high-fidelity v2 coder model
FROM gemma-coder-v2
# === Apply the LoRA Adapter ===
# This is the magic. Ollama will load the gemma-coder-v2
# and then "attach" this adapter on top of it.
# This must be a path to the .gguf file.
ADAPTER ./rust-code-lora.gguf
# We can even override the system prompt to be more specific
SYSTEM """
You are an expert-tier Rust and systems programming AI.
You will be given a task related to Rust, Cargo, or backend development.
Your response must be 100% code, configuration, or the requested artifact.
You must adhere to Rust idioms and best practices, paying close
attention to the borrow checker, memory safety, and performance.
Follow this multi-step reasoning process:
PHASE 1: Think step-by-step.
PHASE 2: Write the Rust code.
PHASE 3: Review and self-correct, checking for common Rust errors.
Begin execution.
"""
- Step 3: Build the Specialist Model
ollama create gemma-coder-v1-rust -f ./Modelfile.gemma-rust
You now have another model in your ollama list. When you run gemma-coder-v2-rust, it will load the 8-bit base model and the Rust adapter, giving you hyper-specialized, domain-specific output. You can repeat this for Python, Go, TypeScript, etc., creating a small, efficient "specialist" model for every language in your stack.
Edit you continue dev config to add gemma-coder-v1-rust
For NVIDIA GPUs (Linux/WSL):
Ollama auto-detects your NVIDIA driver (nvidia-smi) and CUDA. The key is to tell it how many layers to offload. We do this by editing the systemd service.
Open the service file for editing:
sudo systemctl edit ollama.service
This will open a blank override file. Paste the following:
[Service]
# OLLAMA_NUM_GPU controls how many layers are offloaded.
# Setting it to a high number (e.g., 99) tells Ollama to offload
# as many layers as your VRAM can possibly fit.
Environment="OLLAMA_NUM_GPU=99"
# Optional: If you have multiple GPUs, uncomment this
# to pin Ollama to the first one (index 0).
# Environment="OLLAMA_MAIN_GPU=0"
Save the file, then reload and restart Ollama:
sudo systemctl daemon-reload
sudo systemctl restart ollama
Verify: Run your model and watch your VRAM usage:
ollama run gemma3:27b "Why is the sky blue?" &
watch -n 1 nvidia-smi
You should see your GPU's VRAM fill up and the "GPU-Util" climb.